超越文字：理解分詞與棒棒糖測試

語言背後的隱藏架構

大型語言模型（LLMs）並非以人類的方式「閱讀」文字。雖然我們看到的是字母和單字，但模型是以數值塊的形式處理資訊，這些塊被稱為分詞。理解這種抽象概念是掌握提示工程與系統設計的第一步。

棒棒糖測試

為什麼語言模型在反轉「lollipop」這個單字的字母時會遇到困難，卻能在被要求反轉「l-o-l-l-i-p-o-p」時立即成功？

問題所在：在標準寫法中，模型只將整個單字視為一個分詞。它無法清楚地掌握該分詞內各個字母的對應關係。
解決方案：透過用連字符分隔單字，可強制模型將每個字母分別進行分詞，從而提供完成任務所需的細粒度「視角」。

核心原則

分詞比例：根據一般經驗，英文中每 1 個分詞約等於 4 個字元，或約為 0.75 個單字。
上下文視窗：模型具有固定的「記憶體」大小（例如 4096 個分詞）。此限制包含你的指令與模型的回應。

基底模型與指令微調模型

基底語言模型：根據龐大的資料集預測下一個最可能出現的單字（例如，「法國的首都是什麼？」後面可能接「德國的首都是什麼？」）。
指令微調語言模型：透過人類反饋的強化學習（RLHF）進行微調，以遵循特定指令並扮演助手角色。

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

Question 1

If you are processing a document that is 3,000 English characters long, roughly how many tokens will the model consume?

A) 3,000 tokens

B) 750 tokens

C) 12,000 tokens

Question 2

Why is an Instruction-Tuned LLM preferred over a Base LLM for building a chatbot?

A) It is faster at generating text.

B) It uses fewer tokens.

C) It is trained to follow specific tasks and dialogue formats.

Challenge: Token Estimation

Apply the token ratio rule to a real-world scenario.

You are designing an automated summarization system. The system receives daily reports that average 10,000 characters in length.

Your API provider charges $0.002 per 1,000 tokens.

Step 1

Estimate the number of tokens for a single daily report.

Solution:
Using the rule of thumb (1 token ≈ 4 characters):
$$ \text{Tokens} = \frac{10,000}{4} = 2,500 \text{ tokens} $$

Step 2

Calculate the estimated cost to process one daily report.

Solution:
The cost is $0.002 per 1,000 tokens.
$$ \text{Cost} = \left( \frac{2,500}{1,000} \right) \times 0.002 = 2.5 \times 0.002 = \$0.005 $$